Data analysis in R

In this lab, we’ll use some of the functions and methods we’ve looked at previously to carry out some simple data analysis for a couple of datasets. You’ll need the following files for this lab (all should be available on the Google drive):

Example 1: Georgia income data

In the first example, we’ll explore variations in median income in Georgia at the county level. This example is modified from Alex Comber’s GEOG 3915 GeoComputation class. We’ll start with some simple exploration and then move on to running some simple statistical analysis. First load the libraries that we’ll need (these should all have been installed in previous labs):

library(tidyverse)
library(sf)
library(tmap)

Next, we’ll read in the data. All the information we need is held in the the georgia shapefile, so load this with st_read():

georgia <- st_read("./data/georgia/georgia.shp", quiet = TRUE)

This shapefile contains a number of variables for the counties including the percentage of the population in each County that:

  • is Rural (PctRural)
  • have a college degree (PctBach)
  • are elderly (PctEld)
  • that are foreign born (PctFB)
  • that are classed as being in poverty (PctPov)
  • that are black (PctBlack)

and the median income of the county (MedInc in dollars)

Check to make sure that the coordinate reference system is set:

st_crs(georgia)

Exploration

The variable we are interested in is the median income (MedInc). Remember you can check the column names with names() or the data structure with str(). Let’s map this out to see if there’s a pattern:

tm_shape(georgia) +
  tm_fill("MedInc", palette = 'viridis') +
  tm_layout(legend.outside = TRUE)

There’s a reasonably strong N-S gradient in income values, with higher values around Atlanta in the north, and Savannah on the southeastern coast. The values are in $/yr, we’ll rescale them here to make the values a little more manageable:

georgia <- georgia %>%
  mutate(MedInc000 = MedInc / 1000)

Let’s get some summary statistics on this variable:

summary(georgia$MedInc000)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   23.46   29.77   34.31   37.15   41.20   81.60

And make a histogram showing the distribution:

ggplot(georgia, aes(x = MedInc000)) +
  geom_histogram(col = 'lightgray', fill = 'darkorange') +
  scale_x_continuous("Median Income $000s") +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distribution of median income is right-skewed: most of the values fall between about $22K and $40K, but with a few higher values. Normally, we would log-transform these values before working with them further, as this reduces the skew. We’ll skip that here to make some if the results more interpretable, but we can see what difference this would make by changing the x-axis to a log scale:

ggplot(georgia, aes(x = MedInc000)) +
  geom_histogram(col = 'lightgray', fill = 'darkorange') +
  scale_x_log10("Median Income $000s") +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Hotspot analysis

Very briefly, we’ll also demonstrate a simple form of spatial exploration with hotspot analysis. This is based on the Getis-Ord \(G\) statistic, and indicates of there are parts of a region that have much higher (hotspot), or much lower (coldspot), values than might be expected. If you want to run this, you will need to install the spdep library, and then carry out the following steps:

  • Construct a neighborhood function with poly2nb - this identifies which spatial locations are considered to be neighbors of each other
  • Convert this into a spatial weight matrix with nb2listw - a numeric representation of the neighborhoods
  • Use this to calculate the \(G\) statistic with localG
library(spdep)
## Loading required package: spData
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
nb <- poly2nb(georgia)
lw <- nb2listw(nb)
georgia_localG <- localG(georgia$MedInc000, 
                         listw = lw)

Once run, we can plot the map, which not too surprisingly shows hotspots around Atlanta and on the coast.

georgia$lG <- georgia_localG
tm_shape(georgia) +
  tm_fill("lG", palette = "-RdBu") +
  tm_borders('lightgray') +
  tm_layout(main.title = "Getis Ord G Values")
## Variable(s) "lG" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Statistical tests

R comes with a large number of statistical tests. We’ll just look here at the simplest, the \(t\)-test. The format of the output is broadly similar for most of these tests, so the interpretation we make here should transfer to other tests. The \(t\)-test is a test for a significant difference of means between two groups. We’ll create two groups using the PctBach variable (the percent of the population in each county with higher education degree). We’ll split the counties by whether they are above the median PctBach value for the state:

median(georgia$PctBach)
## [1] 9.4

We’ll use a combination of mutate and the ifelse function to create a new, binary variable:

georgia <- georgia %>%
  mutate(higher_ed = ifelse(PctBach > 10, "High", "Low"))

And we can use boxplots to examine the difference in median income for these two groups:

ggplot(georgia, aes(x = higher_ed, y = MedInc000)) +
  geom_boxplot() +
  scale_y_continuous("Med Inc $000s") +
  theme_bw()

We’ll now run a \(t\)-test. For this, we need to define the variable containing the groups (higher_ed), and the variable we want to run the test on (MedInc000).

We define this using R’s formula syntax. This syntax is used across a lot of test and modeling functions and is written as y ~ x, where y is the dependent variable (the one we want to test) and x is the independent (the one we want to use to run the test). We’ll see this again in the next section:

t.test(MedInc000 ~ higher_ed, georgia)
## 
##  Welch Two Sample t-test
## 
## data:  MedInc000 by higher_ed
## t = 5.2655, df = 98.429, p-value = 8.219e-07
## alternative hypothesis: true difference in means between group High and group Low is not equal to 0
## 95 percent confidence interval:
##   5.467864 12.081624
## sample estimates:
## mean in group High  mean in group Low 
##           42.22430           33.44955

This test gives a lot of output, but the main ones are

  • the test statistic (t): this is used in running the test (here a standardized difference)
  • the \(p\)-value: the significance level of the test.

When \(t\) is high and the \(p\)-value is low (\(<0.05\)) as here, this indicates a significant difference between the two groups.

Linear model

Next, we’ll build a simple regression model of the income values using the original values of percent higher education in each county. I’m sure you’re familiar with this, but as a quick recap, this models the changes in a dependent variable as a function of an intercept (\(\beta_0\)) and a slope (\(\beta_1\)) times a covariate (PctBach):

\[ y = \beta_0 + \beta_1 x + e \]

The lm() function is used in R to build simple linear regression models. This again uses the formula syntax described above, and we need to specify the R object that contains the variables for the model (georgia). We’ll fit this now, and use the summary() function to look at the results:

fit_lm1 <- lm(MedInc000 ~ PctBach, data = georgia) 
summary(fit_lm1)
## 
## Call:
## lm(formula = MedInc000 ~ PctBach, data = georgia)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.124  -6.025  -1.905   3.416  39.969 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.5913     1.5476  17.182  < 2e-16 ***
## PctBach       0.9643     0.1255   7.684 1.57e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.986 on 157 degrees of freedom
## Multiple R-squared:  0.2733, Adjusted R-squared:  0.2687 
## F-statistic: 59.04 on 1 and 157 DF,  p-value: 1.566e-12

There’s a lot of output here, but the most important parts are:

  • A summary description of the residuals (look to see that the median is close to zero, and the 1st and 3rd quantiles are approximately equal)
  • The value, standard error and significance (\(p\)-values) of the coefficients
  • The amount of variance explained (R-squared)

As a very quick interpretation, we have an intercept of 26.6 and a slope of 1. The slope close to 1 suggests that median income increases by approximate $1,000 dollars for every percent increase in higher education. The R2 is around 0.27, indicating that 27% of the variation in income is explained by our model.

We can visualize this model by using ggplot2’s geom_smooth() function to add a regression line to a scatter plots:

ggplot(georgia, aes(x = PctBach, y = MedInc000)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

(This shows, again, the uneven distribution of values.)

You can add more covariates to the model by extending the right hand side of the formula. Here we’ll add the percent elderly and percent foreign born to explore their relationship with income

fit_lm2 <- lm(MedInc000 ~ PctBach + PctEld + PctFB, data = georgia) 
summary(fit_lm2)
## 
## Call:
## lm(formula = MedInc000 ~ PctBach + PctEld + PctFB, data = georgia)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.979  -4.789  -0.922   2.712  34.805 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  51.5661     3.5282  14.616  < 2e-16 ***
## PctBach       0.7980     0.1479   5.397 2.49e-07 ***
## PctEld       -1.7890     0.2312  -7.738 1.21e-12 ***
## PctFB        -1.9020     0.6914  -2.751  0.00665 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.672 on 155 degrees of freedom
## Multiple R-squared:  0.4771, Adjusted R-squared:  0.4669 
## F-statistic: 47.13 on 3 and 155 DF,  p-value: < 2.2e-16

The new covariates have negative coefficients, indicating the income declines as these increase in Georgia. The R\(^2\) has increased to approaximately 0.47.

Geographically weighted regression

Regression models (and most other models) are aspatial and are commonly referred to as global models, as the model coefficients are assumed to hold true everywhere. In reality this assumption of spatial invariance is violated in many instances when geographic phenomena are considered. For example, the association between PctBach and MedInc, might be different in different places. To assess this, we can use geographically weighted regression or GWR. This estimates a series of local models, one per spatial location. Each model is built on a subset of data: a set of locations in a window around the location of interest. As a results, the coefficients vary in space. The choice of window size is important, as it dictates the number of observations used in each local model, and so the quality of that model. While there has been several papers criticizing this approach as a complete modeling method, it is very useful for exploring potential variation in the relationship between dependent and independent variables

We now build the GWR model using the spgwr package. Building a GWR model requires two steps, the first to assess the best window size, and the second to build and diagnose the local models using this window. The window can be chosen as

  • Fixed size: each window will have the same bandwidth or size, so models in data sparse areas will have fewer observations
  • Adaptive: rather than setting a single window size, the window for each model is chosen to capture the same number of observations

Here, we will use the second method, by setting the parameter adapt = TRUE in the GWR function. We first need to extract polygon centroids for use in the distance calculations (these will be used to select the locations within a window around a point of interest).

library(spgwr)

georgia_crds = st_coordinates(st_centroid(georgia))
plot(st_geometry(georgia))
points(georgia_crds, pch = 16)

The gwr.sel function can be used to work out the optimum window size. It does this by removing a subset of the locations and testing how well the remaining locations can predict income for the subset. This is iterated across a set of bandwidths until the optimum is found.

gwr_bw = gwr.sel(MedInc000 ~ PctBach + PctEld + PctFB, 
        coords = georgia_crds, data = georgia, adapt = TRUE)
## Adaptive q: 0.381966 CV score: 8454.503 
## Adaptive q: 0.618034 CV score: 9109.154 
## Adaptive q: 0.236068 CV score: 7977.137 
## Adaptive q: 0.145898 CV score: 7762.359 
## Adaptive q: 0.09016994 CV score: 7714.123 
## Adaptive q: 0.07639362 CV score: 7673.659 
## Adaptive q: 0.04721385 CV score: 7913.753 
## Adaptive q: 0.06524794 CV score: 7616.704 
## Adaptive q: 0.05835953 CV score: 7715.613 
## Adaptive q: 0.06950521 CV score: 7596.978 
## Adaptive q: 0.06901448 CV score: 7594.047 
## Adaptive q: 0.0681993 CV score: 7599.133 
## Adaptive q: 0.06894048 CV score: 7594.512 
## Adaptive q: 0.06912234 CV score: 7593.367 
## Adaptive q: 0.06926858 CV score: 7594.05 
## Adaptive q: 0.06916303 CV score: 7593.11 
## Adaptive q: 0.06920372 CV score: 7593.25 
## Adaptive q: 0.06916303 CV score: 7593.11
gwr_bw
## [1] 0.06916303

As this is an adaptive bandwidth, the optimum value (0.069) indicates the proportion of counties that are included in each local model (about 7%). We can now use this to estimate the full set of local models (one per county):

fit_gwr = gwr(MedInc000 ~ PctBach + PctEld + PctFB, 
              coords = georgia_crds, data = georgia, adapt = gwr_bw)

fit_gwr
## Call:
## gwr(formula = MedInc000 ~ PctBach + PctEld + PctFB, data = georgia, 
##     coords = georgia_crds, adapt = gwr_bw)
## Kernel function: gwr.Gauss 
## Adaptive quantile: 0.06916303 (about 10 of 159 data points)
## Summary of GWR coefficient estimates at data points:
##                   Min.   1st Qu.    Median   3rd Qu.      Max. Global
## X.Intercept. 38.933286 51.703499 56.530763 63.546895 78.252407 51.566
## PctBach      -0.095338  0.358524  0.554166  0.884791  1.508898  0.798
## PctEld       -3.265517 -2.397512 -2.050739 -1.707752 -1.117719 -1.789
## PctFB        -6.921809 -3.515077 -1.845325 -0.649226  1.331465 -1.902

The output of the model now contains a range of coefficient estimates for each variable showing how much these vary across Georgia. We can also visualize this, by extracting the values for each county. These are held in a object called SDF in the output of the gwr() function:

fit_gwr$SDF
## class       : SpatialPointsDataFrame 
## features    : 159 
## extent      : 949905.4, 1392468, -674547.7, -217623.8  (xmin, xmax, ymin, ymax)
## crs         : NA 
## variables   : 8
## names       :            sum.w,      (Intercept),             PctBach,            PctEld,             PctFB,             gwr.e,             pred,           localR2 
## min values  : 15.0109625667414, 38.9332861313754, -0.0953381517537397, -3.26551722658842, -6.92180900589457, -16.9322238406029, 20.2667549521815, 0.553666131713563 
## max values  : 29.0794134535373, 78.2524067679104,    1.50889803811775, -1.11771940077526,  1.33146450846935,   27.466910234057,  64.731234060467, 0.765569568212537

We can attach these back to the georgia sf object and make some maps:

georgia$b_PctBach = fit_gwr$SDF$PctBach
georgia$b_PctEld = fit_gwr$SDF$PctEld
georgia$b_PctFB = fit_gwr$SDF$PctFB
georgia$localR2 = fit_gwr$SDF$localR2

The last line extracts the local R2 (the R2 for each model). Now let’s map some of these values:

  • Local R2
tm_shape(georgia) +
  tm_fill( "localR2", palette = "viridis", style = "kmeans") +
  tm_layout(legend.position = c("right","top"), frame = F)

  • Percent elderly
tm_shape(georgia) +
  tm_fill( "b_PctEld", palette = "viridis", style = "kmeans") +
  tm_layout(legend.position = c("right","top"), frame = F)

This suggests that relationship between percent elderly and income is much stronger around the metropoloitan regions of the state

  • Percent Bachelors and Foreign born
tm_shape(georgia) +
  tm_fill(c("b_PctFB", "b_PctBach"),midpoint = 0, style = "kmeans") +
  tm_style("col_blind")+
  tm_layout(legend.position = c("right","top"), frame = F) 

Machine Learning in R

Machine learning is a set of algorithms and methods that can identify patterns in data and predict from these patterns. This has a number of uses with spatial data, including predicting land cover or species occurrences, or with remote sensing data for segmentation and classification. Here, we’ll use a set of landslide occurrences from southern Ecuador to make a map of landslide risk. We’ll use a machine learning approach to examine what covariates are linked with landslides or absences of landslides, and then use a raster of these values to produce a full risk map. This example is modified from Robin Lovelace’s Geocomputation in R book.

You’ll need the following files to run this lab, so make sure you have them downloaded from the Google drive.

  • lsl.csv: a set of landslide locations, with associated terrain variables
  • ta.tif: a raster file with terrain variables for prediction
  • mask.shp: a study area mask

You’ll also need to install a package to help with machine learning (mlr3). This is a meta-package, and is used to help streamline the machine learning process by providing a standard interface to a large number of algorithms, as well as functions to test models. This comes as a set of packages, so the easiest is to install the whole set with (install.packages("mlr3verse")). We’ll also be using the pROC package, so install that as well. Once installed, load the libraries we’ll need:

library(terra)
library(dplyr)
library(pROC)
library(mlr3verse)

Next, we’ll load the landslide occurrences:

lsl <- read.csv("./data/lsl.csv")
head(lsl)
##          x       y lslpts    slope        cplan        cprof     elev
## 1 713887.7 9558537  FALSE 33.75185  0.023180449  0.003193061 2422.810
## 2 712787.7 9558917  FALSE 39.40821 -0.038638908 -0.017187813 2051.771
## 3 713407.7 9560307  FALSE 37.45409 -0.013329108  0.009671087 1957.832
## 4 714887.7 9560237  FALSE 31.49607  0.040931452  0.005888638 1968.621
## 5 715247.7 9557117  FALSE 44.07456  0.009686948  0.005149810 3007.774
## 6 714927.7 9560777  FALSE 29.85981 -0.009047707 -0.005738329 1736.887
##   log10_carea
## 1    2.784319
## 2    4.146013
## 3    3.643556
## 4    2.268703
## 5    3.003426
## 6    3.174073

This has a set of variables derived from a DEM that we used to build the model:

  • slope: slope angle (degrees)
  • cplan: plan curvature (rad m−1) expressing the convergence or divergence of a slope and thus water flow
  • cprof: profile curvature (rad m-1) as a measure of flow acceleration, also known as downslope change in slope angle
  • elev: elevation (m a.s.l.) as the representation of different altitudinal zones of vegetation and precipitation in the study area
  • log10_carea: the decadic logarithm of the catchment area (log10 m2) representing the amount of water flowing towards a location

As we need to be able to contrast where landslides have occurred, with where they have not, the file contains a column lslpts that indicates for each location if there was a landslide (TRUE) or if it is a background point without a landslide (FALSE). We will need to make this a factor - this is a particular R data type used to differentiate categorical data.

lsl <- lsl %>% 
  mutate(lslpts_f = factor(lslpts))

We’ll also use this dataframe to create an sf object to see the locations:

lsl_sf <- st_as_sf(lsl, coords = c("x", "y"),
                   crs = 32717)

Next, we’ll read in the mask data, and map the locations within the mask:

lsl_mask <- st_read("./data/mask/study_mask.shp")
## Reading layer `study_mask' from data source 
##   `/Users/u0784726/Dropbox/Data/devtools/ugic-intro-r/data/mask/study_mask.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 1 field
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 712112.8 ymin: 9556864 xmax: 715794 ymax: 9560905
## Projected CRS: WGS 84 / UTM zone 17S
tm_shape(lsl_mask) +
  tm_borders() +
  tm_shape(lsl_sf) +
  tm_symbols(col = "lslpts", size = 0.75, alpha = 0.7)

You should be able to see that the landslides (yellow) tend to cluster in the center of the study region.

Next, read in the raster data using terra. This has raster layers with the same variables in the lsl object. Note that it is important that the names of these layers match the column names in the dataframe exactly for prediction. We’ll also use the mask to crop out the study area:

ta = rast("./data/ta.tif")
ta = mask(ta, mask = lsl_mask)
plot(ta)

If we add the locations to the elevation layer, you see that the cluster tends to occur at mid-elevations.

tm_shape(ta['elev']) +
  tm_raster(palette = '-Greens', style = 'cont') +
  tm_shape(lsl_mask) +
  tm_borders() +
  tm_shape(lsl_sf) +
  tm_symbols(col = "lslpts", size = 0.5, alpha = 0.7) +
  tm_layout(legend.outside = TRUE)

Terrain characteristics in terra

This is an otpional section. In this dataset, the terrain characteristics (slope, aspect, etc) were precalculated. The terra package has a function to estimate these (terrain) from a DEM. The following code estimates the slope and aspect in radians, then uses the shade() function to create a hill shading map. This has arguments to position the light source (angle = azimuth, direction = direction):

slope <- terrain(ta['elev'], "slope", unit = 'radians')

aspect <- terrain(ta['elev'], "aspect", unit = 'radians')

hs <- shade(slope, aspect, angle = 45, direction = 90)

We can then use this as a base map in tmap, and overlay another raster layer with a transparency (alpha):

tm_shape(hs) +
  tm_raster(palette = "Greys", legend.show = FALSE) +
  tm_shape(ta['elev']) +
  tm_raster(palette = '-Greens', alpha = 0.35, legend.show = FALSE)

Modeling

We’ll start by building two relatively simple models to predict landslide risk. We’ll use a logistic regression model (similar to the linear model above, but designed for binary outcomes), and a random forest, a widely used machine learning method. For each of these, we’ll do three things:

  • Build the model
  • Assess the model predictions using the AUROC score
  • Predict using the values in the ta raster object to get a map of risk

Logistic regression

Logistic models can be fit in R using the glm() function. This is designed for regression models that don’t have continuous outcomes (in contrast to lm() that we used above). As before, we need to define:

  • The formula that relates a dependent variable to the independent(s)
  • The dataset containing the variables

We also need to define a family. For binary outcomes, this needs to be set to binomial:

fit_glm <- glm(lslpts ~ slope + cplan + cprof + elev + log10_carea,
               family = binomial(),
               data = lsl)

If you;d like to see the output of the model (e.g. coefficients), run summary(fit_glm). I’m skipping this step here.

Next, we’ll test the prediction using the AUROC. This stands for Area Under the Receiver Operating Characteristic curve. The full explanation is a little complex, but essentially we use the model to predict a risk value for each of the observed points with fitted(). The AUROC compares these to the observed presence or absence of landslides. AUROC values can range from 0 to 1, but we really need values above 0.5 to indicate a reasonable prediction (a value of 0.5 indicates that the model is no better than flipping a coin!).

First, get predicted values for each location:

lsl_pred <- fitted(fit_glm)
head(lsl_pred)
##          1          2          3          4          5          6 
## 0.19010373 0.11718500 0.09519487 0.25030946 0.33819463 0.15754547

You’ll see here that the predictions are made as a probability (i.e. in the range [0-1]). Now, let’s estimate the AUROC:

auc(roc(lsl_sf$lslpts, lsl_pred))
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
## Area under the curve: 0.8216

Giving a value of about 0.82, which is a pretty decent skill. Now, let’s predict across the raster to get our risk map. A note here, we want to use the predict() function from the terra package, as this will return predictions as a raster layer. To force this, we prepend the name of the packahe (terra::) before the function.

pred <- terra::predict(ta, model = fit_glm, type = "response")

Finally, we can make our map:

m_glm <- tm_shape(pred) +
  tm_raster() +
  tm_shape(lsl_sf) +
  tm_dots(size = 0.2) +
  tm_layout(main.title = "GLM")

m_glm

Darker shades on this map indicate areas with higher risks.

Random forest

Now let’s repeat these steps with a random forest. There are several packages for this, but we’ll use the RandomForest package, which is the oldest and most stable. You should see that while we are using a different package and function to build the model, the formula syntax is exactly the same as before. You’ll probbaly get a warning, which we’ll ignore for now.

library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
fit_rf <- randomForest(lslpts ~ slope + cplan + cprof + elev + log10_carea,
                       data = lsl)

Now estimate the AUROC. We get a small improvement here, from 0.82 to about 0.85.

lsl_pred <- predict(fit_rf)
auc(roc(lsl_sf$lslpts, lsl_pred))
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
## Area under the curve: 0.8485

And predict on the raster:

pred = terra::predict(ta, model = fit_rf)

m_rf <- tm_shape(pred) +
  tm_raster() +
  tm_shape(lsl_sf) +
  tm_dots(size = 0.2) +
  tm_layout(main.title = "Random Forest")
m_rf

tmap_arrange(m_glm, m_rf)

Cross-validation

In the previous section, we built and tested two model. However, the AUROC values that we used to compare them are not a good estimate of predictive skill. The reason for this is that we used the same set of data to build and then test the model - the test is not an independent assessment. To get around this, we can use a cross-validation approach to model testing. In this, the dataset is split into two: a training set used to build the model, and a testing set used to, well, test it. As the test set is not used to train or build the model, it is considered independent (the model does not ‘see’ these data), and the resulting test is a better estimate of predictive skill.

While you can set up and run cross-validation by hand, it’s generally easaier to do this with a helper package. We’ll use mlr3 that you installed and loaded earlier. We need to define a series of things to make this work:

A task

This defines the dataset, the outcome or target variable and the independent or features used to model the outcome. This needs to have at a minimum: - The backend: the dataframe or obkect holding the data - The target: the variable in the backend that we want to predict

The function to create this is TaskClassif. This is designed for binary or categorical outcomes. There is equally a TaskRegr for continuous outcomes.

lsl_task <- TaskClassif$new(id = "lsl", backend = lsl, 
                            target = "lslpts_f")

We can check how this is set up (i.e. what is the target and the features/variables):

lsl_task$col_roles
## $feature
## [1] "cplan"       "cprof"       "elev"        "log10_carea" "lslpts"     
## [6] "slope"       "x"           "y"          
## 
## $target
## [1] "lslpts_f"
## 
## $name
## character(0)
## 
## $order
## character(0)
## 
## $stratum
## character(0)
## 
## $group
## character(0)
## 
## $weight
## character(0)
## 
## $always_included
## character(0)

Note that this includes the coordinates and the original landslide labels, so we’ll exclude these so they are not used in the model:

lsl_task$col_roles$feature <- setdiff(lsl_task$col_roles$feature,
                                      c("x", "y", "lslpts"))
lsl_task$feature_names
## [1] "cplan"       "cprof"       "elev"        "log10_carea" "slope"

A measure

This is score that is used to assess how well the model predicts, and is defined using msr(). We’ll use two here: the AUROC () and the accuracy (the proportion of points that are correctly predicted)

msr_auc = msr("classif.auc")
msr_acc = msr("classif.acc")

If you want to see the full set of measures, just run the following code:

mlr_measures

A resampling

This is the cross-validation strategy; the way in which the data will be split into training and testing, and is defined using rsmp(). We'll use a $k$-fold cross-validation (cv`). This splits the data \(k\) times. So for a 5-fold CV, the data is split into 80% training and 20% testing, and then this is repeated 4 more times. This ensures that all datapoints are used once in the testing set and gives an exhaustive evaluation.

rsmp_cv = rsmp("cv", folds = 5)

If you want to see the full set of resampling strategies, just run the following code:

mlr_resamplings

A learner

The machine learning algorithm to be used, this is defined using lrn(). We’ll use a logistic model again (classif.log_reg):

lrn_glm = lrn("classif.log_reg", 
              predict_type = "prob")

If you want to see the full set of learners, just run the following code:

mlr_learners

Running the cross-validation

Now all this is defined, we can actually run the cross-validation and get the results. The function to run this is called resample(), and has arguments for the task, the learner and the cross-validation.

rr_glm = resample(lsl_task, 
                  lrn_glm, 
                  rsmp_cv, 
                  store_models = TRUE)
## INFO  [17:23:58.413] [mlr3] Applying learner 'classif.log_reg' on task 'lsl' (iter 1/5)
## INFO  [17:23:58.437] [mlr3] Applying learner 'classif.log_reg' on task 'lsl' (iter 2/5)
## INFO  [17:23:58.446] [mlr3] Applying learner 'classif.log_reg' on task 'lsl' (iter 3/5)
## INFO  [17:23:58.451] [mlr3] Applying learner 'classif.log_reg' on task 'lsl' (iter 4/5)
## INFO  [17:23:58.457] [mlr3] Applying learner 'classif.log_reg' on task 'lsl' (iter 5/5)

This should run pretty quickly, and all the results are stored in rr_glm. We asked the function to store the individual models, and you can access them (if you need to) with rr_glm$learners.

You can now get the average AUROC and accuracy for the 5 models with:

rr_glm$aggregate(c(msr_auc, msr_acc))
## classif.auc classif.acc 
##   0.8218499   0.7400000

You can also see the values for the individual folds with:

rr_glm$score(msr_auc)
##    task_id      learner_id resampling_id iteration classif.auc
##     <char>          <char>        <char>     <int>       <num>
## 1:     lsl classif.log_reg            cv         1   0.7624898
## 2:     lsl classif.log_reg            cv         2   0.8026316
## 3:     lsl classif.log_reg            cv         3   0.8273026
## 4:     lsl classif.log_reg            cv         4   0.8503289
## 5:     lsl classif.log_reg            cv         5   0.8664966
## Hidden columns: task, learner, resampling, prediction

Changing learners

This all probably seems like a lot of effort to get to this point, but now we have everything set up, it is very easy to test different machine learning algorithms. All we have to do is define a new learner (here a random forest), re-run the resampling and check the resulting scores:

lrn_rf = lrn("classif.ranger", 
             predict_type = "prob", 
             importance = "permutation")

rr_rf = resample(lsl_task, 
                  lrn_rf, 
                  rsmp_cv, 
                  store_models = TRUE)
## INFO  [17:23:58.560] [mlr3] Applying learner 'classif.ranger' on task 'lsl' (iter 1/5)
## INFO  [17:23:58.643] [mlr3] Applying learner 'classif.ranger' on task 'lsl' (iter 2/5)
## INFO  [17:23:58.706] [mlr3] Applying learner 'classif.ranger' on task 'lsl' (iter 3/5)
## INFO  [17:23:58.772] [mlr3] Applying learner 'classif.ranger' on task 'lsl' (iter 4/5)
## INFO  [17:23:58.833] [mlr3] Applying learner 'classif.ranger' on task 'lsl' (iter 5/5)
rr_rf$aggregate(c(msr_auc, msr_acc))
## classif.auc classif.acc 
##   0.8637583   0.7828571

Our final AUROC score for the random forest is 0.864. These scores are overall similar to the results from the non-independent test earlier, but are now a much more robust estimate of predictive skill.

Final thoughts

These two exercises have introduced some basic statistical and machine methods in R, but they barely scratch the surface of what is possible. There are many other methods and approaches out there. Here’s a very shortlist of possible resource that may be helpful: